Analyzing and identifying multiword expressions in spoken language
نویسندگان
چکیده
The present paper investigates multiword expressions (MWEs) in spo ken language and possible ways of identifying MWEs automatically in speech corpora. Two MWEs that emerged from previous studies and that occur frequently in Dutch are analyzed to study their pronunciation characteristics and compare them to those of other utterances in a large speech corpus. The analyses reveal that these MWEs display extreme pronunciation variation and reduction, i.e., many phonemes and even syllables are deleted. Several measures of pronunciation reduction are calculated for these two MWEs and for all other utterances in the corpus. Five of these measures are more than twice as high for the MWEs, thus indicating con siderable reduction. One overall measure of pronunciation deviation is then calculated and used to automatically identify MWEs in a large speech corpus. The results show that neither this overall measure, nor frequency of co-occurrence alone are suitable for identifying MWEs. The best results are obtained by using a metric that combines overall pronunciation reduction with weighted frequency. In this way, recurring ‘‘islands of pronunciation reduction’’ that contain (potential) MWEs can be identified in a large speech corpus. H. Strik (El) ■ M. Hulsbosch ■ C. Cucchiarini Department of Linguistics (Section Language and Speech), Radboud University, P.O. Box 9103, 6500 HD Nijmegen, The Netherlands e-mail: [email protected]; [email protected] URL: http://lands.let.ru.nl/ * strik/ M. Hulsbosch e-mail: [email protected] C. Cucchiarini e-mail: [email protected] Present Address: H. Strik Erasmus Building, room 8.14, Erasmusplein 1, 6525 HT Nijmegen, The Netherlands
منابع مشابه
Parsing Models for Identifying Multiword Expressions
Multiword expressions lie at the syntax/semantics interface and have motivated alternative theories of syntax like Construction Grammar. Until now, however, syntactic analysis and multiword expression identification have been modeled separately in natural language processing. We develop two structured prediction models for joint parsing and multiword expression identification. The first is base...
متن کاملVague Language and Interpersonal Communication: An Analysis of Adolescent Intercultural Conversation
This paper is concerned with the analysis of the spoken language of teenagers, taken from a newly developed specialised corpus the British and Taiwanese Teenage Intercultural Communication Corpus (BATTICC). More specifically, the study employs a discourse analytical approach to examine vague language in an intercultural context among a group of British and Taiwanese adolescents, paying particul...
متن کاملUSzeged: Identifying Verbal Multiword Expressions with POS Tagging and Parsing Techniques
The paper describes our system submitted for the Workshop on PARSEME’s Shared Task on automatic identification of verbal multiword expressions . It uses POS tagging and dependency parsing to identify singleand multi-token verbal MWEs in text. Our system is language-independent and competed on nine of the eighteen languages. Our paper describes how our system works and gives its error analysis f...
متن کاملSemantic Clustering: an Attempt to Identify Multiword Expressions in Bengali
One of the key issues in both natural language understanding and generation is the appropriate processing of Multiword Expressions (MWEs). MWE can be defined as a semantic issue of a phrase where the meaning of the phrase may not be obtained from its constituents in a straightforward manner. This paper presents an approach of identifying bigram noun-noun MWEs from a medium-size Bengali corpus b...
متن کاملIdentification of Multi-word Expressions by Combining Multiple Linguistic Information Sources
We propose a framework for using multiple sources of linguistic information in the task of identifying multiword expressions in natural language texts. We define various linguistically motivated classification features and introduce novel ways for computing them. We then manually define interrelationships among the features, and express them in a Bayesian network. The result is a powerful class...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Language Resources and Evaluation
دوره 44 شماره
صفحات -
تاریخ انتشار 2010